Logistic Regression is a Machine Learning algorithm used to make predictions to find the value of a dependent variable such as the condition of a tumor (malignant or benign), classification of email (spam or not spam), or admission into a university (admitted or not admitted) by learning from independent variables (various features relevant to the problem).
For example, for classifying an email, the algorithm will use the words in the email as features and based on that make a prediction whether the email is spam or not.
Logistic Regression is a supervised Machine Learning algorithm, which means the data provided for training is labeled i.e., answers are already provided in the training set. The algorithm learns from those examples and their corresponding answers (labels) and then uses that to classify new examples.
In mathematical terms, suppose the dependent variable is Y and the set of independent variables is X, then logistic regression will predict the dependent variable P(Y=1) as a function of X, the set of independent variables.
Logistic Regression can be divided into types based on the type of classification it does. With that in view, there are 3 types of Logistic Regression. Let’s talk about each of them:
The major difference between Logistic and Linear Regression is that Linear Regression is used to solve regression problems whereas Logistic Regression is used for classification problems. In regression problems, the target variable can have continuous values such as the price of a product, the age of a participant, etc. While, the classification problems deal with the prediction of target variable that can only have discrete values, for example, prediction of gender of a person, prediction of a tumor to be malignant or benign, etc.
To perform the logistic regression we shall be performing the following:
a. Importing libraries that will be needed. Should incase the libraries aren't install we shall be using the pip installation module.
b. Performing Data importation and preprocessing
c. Perform Exploratory visualization
d. Perform model building and training
e. Perform model evaluation
f. Conclusion
import pandas as pd # data processing
import numpy as np # working with arrays
import itertools # construct specialized tools
import matplotlib.pyplot as plt # visualizations
import seaborn as sns
from matplotlib import rcParams # plot size customization
from termcolor import colored as cl # text customization
from sklearn.model_selection import train_test_split # splitting the data
from sklearn.linear_model import LogisticRegression # model algorithm
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler # data normalization
from sklearn.metrics import jaccard_score # evaluation metric
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import precision_score # evaluation metric
from sklearn.metrics import classification_report # evaluation metric
from sklearn.metrics import confusion_matrix # evaluation metric
from sklearn.metrics import log_loss # evaluation metric
rcParams['figure.figsize'] = (20, 10)
## Load data set
df = pd.read_csv("C:/Users/IZZYLYF/OneDrive/hamoyeGIT/Machine Learning Classification - Managing the Quality Metric of Global Ecological Footprint\elecgrid.csv")
Data has been read and read as 'df'. The data we are working with is of dimension 10000 rows and 14 features recorded as columns. A statistical description of the data set shows that 'stabf' is a factor variable of twol levels: Stable and Unstable. The entire data set are unique in their value recordings with thirteen float varaibles and one object variable. The data set is 100% complete with no missling, null or duplicate. The target varaible 'stabf' has 6380 unstable records and 3620 stable records. The 'stab' feature will be droped as it is directly related to the target varaible.
print(df.shape)
(10000, 14)
print (df.describe)
<bound method NDFrame.describe of tau1 tau2 tau3 tau4 p1 p2 p3 \
0 2.959060 3.079885 8.381025 9.780754 3.763085 -0.782604 -1.257395
1 9.304097 4.902524 3.047541 1.369357 5.067812 -1.940058 -1.872742
2 8.971707 8.848428 3.046479 1.214518 3.405158 -1.207456 -1.277210
3 0.716415 7.669600 4.486641 2.340563 3.963791 -1.027473 -1.938944
4 3.134112 7.608772 4.943759 9.857573 3.525811 -1.125531 -1.845975
... ... ... ... ... ... ... ...
9995 2.930406 9.487627 2.376523 6.187797 3.343416 -0.658054 -1.449106
9996 3.392299 1.274827 2.954947 6.894759 4.349512 -1.663661 -0.952437
9997 2.364034 2.842030 8.776391 1.008906 4.299976 -1.380719 -0.943884
9998 9.631511 3.994398 2.757071 7.821347 2.514755 -0.966330 -0.649915
9999 6.530527 6.781790 4.349695 8.673138 3.492807 -1.390285 -1.532193
p4 g1 g2 g3 g4 stab stabf
0 -1.723086 0.650456 0.859578 0.887445 0.958034 0.055347 unstable
1 -1.255012 0.413441 0.862414 0.562139 0.781760 -0.005957 stable
2 -0.920492 0.163041 0.766689 0.839444 0.109853 0.003471 unstable
3 -0.997374 0.446209 0.976744 0.929381 0.362718 0.028871 unstable
4 -0.554305 0.797110 0.455450 0.656947 0.820923 0.049860 unstable
... ... ... ... ... ... ... ...
9995 -1.236256 0.601709 0.779642 0.813512 0.608385 0.023892 unstable
9996 -1.733414 0.502079 0.567242 0.285880 0.366120 -0.025803 stable
9997 -1.975373 0.487838 0.986505 0.149286 0.145984 -0.031810 stable
9998 -0.898510 0.365246 0.587558 0.889118 0.818391 0.037789 unstable
9999 -0.570329 0.073056 0.505441 0.378761 0.942631 0.045263 unstable
[10000 rows x 14 columns]>
print(sorted(df))
['g1', 'g2', 'g3', 'g4', 'p1', 'p2', 'p3', 'p4', 'stab', 'stabf', 'tau1', 'tau2', 'tau3', 'tau4']
print(df.nunique())
tau1 10000 tau2 10000 tau3 10000 tau4 10000 p1 10000 p2 10000 p3 10000 p4 10000 g1 10000 g2 10000 g3 10000 g4 10000 stab 10000 stabf 2 dtype: int64
df.head()
| tau1 | tau2 | tau3 | tau4 | p1 | p2 | p3 | p4 | g1 | g2 | g3 | g4 | stab | stabf | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.959060 | 3.079885 | 8.381025 | 9.780754 | 3.763085 | -0.782604 | -1.257395 | -1.723086 | 0.650456 | 0.859578 | 0.887445 | 0.958034 | 0.055347 | unstable |
| 1 | 9.304097 | 4.902524 | 3.047541 | 1.369357 | 5.067812 | -1.940058 | -1.872742 | -1.255012 | 0.413441 | 0.862414 | 0.562139 | 0.781760 | -0.005957 | stable |
| 2 | 8.971707 | 8.848428 | 3.046479 | 1.214518 | 3.405158 | -1.207456 | -1.277210 | -0.920492 | 0.163041 | 0.766689 | 0.839444 | 0.109853 | 0.003471 | unstable |
| 3 | 0.716415 | 7.669600 | 4.486641 | 2.340563 | 3.963791 | -1.027473 | -1.938944 | -0.997374 | 0.446209 | 0.976744 | 0.929381 | 0.362718 | 0.028871 | unstable |
| 4 | 3.134112 | 7.608772 | 4.943759 | 9.857573 | 3.525811 | -1.125531 | -1.845975 | -0.554305 | 0.797110 | 0.455450 | 0.656947 | 0.820923 | 0.049860 | unstable |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tau1 10000 non-null float64 1 tau2 10000 non-null float64 2 tau3 10000 non-null float64 3 tau4 10000 non-null float64 4 p1 10000 non-null float64 5 p2 10000 non-null float64 6 p3 10000 non-null float64 7 p4 10000 non-null float64 8 g1 10000 non-null float64 9 g2 10000 non-null float64 10 g3 10000 non-null float64 11 g4 10000 non-null float64 12 stab 10000 non-null float64 13 stabf 10000 non-null object dtypes: float64(13), object(1) memory usage: 1.1+ MB
sum(df.duplicated())
0
#to determine the missing entries
df.isnull()
| tau1 | tau2 | tau3 | tau4 | p1 | p2 | p3 | p4 | g1 | g2 | g3 | g4 | stab | stabf | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9996 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9997 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9998 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9999 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
10000 rows × 14 columns
# Count total NaN at each column in a DataFrame
print(" \nCount total NaN at each column in a DataFrame : \n\n",
df.isnull().sum())
Count total NaN at each column in a DataFrame : tau1 0 tau2 0 tau3 0 tau4 0 p1 0 p2 0 p3 0 p4 0 g1 0 g2 0 g3 0 g4 0 stab 0 stabf 0 dtype: int64
#drop features closely related to the target feature
df.drop('stab', axis='columns', inplace=True)
#check distribution of target variable
df['stabf'].value_counts()
unstable 6380 stable 3620 Name: stabf, dtype: int64
#df.hist(figsize=(14,14), xrot=45)
#plt.show()#drop stab columns
#df.drop('Unnamed: 0', axis = 1, inplace = True)
#df.drop('stab', axis='columns', inplace=True)
print(df.shape)
(10000, 13)
df.hist(figsize=(14,14), xrot=45)
plt.show()
sns.pairplot(df,hue='stabf',height=4)
<seaborn.axisgrid.PairGrid at 0x227e2c1b3a0>